COVID-19 Vaccines by County

Step 2: Spatial Modeling of the Logistic Growth Curve Paramaters with INLA

Authors
Affiliations
Maciej Rysz

Farmer School of Business, Miami University

Longwen Zhao

College for Public Health and Social Justice, Saint Louis University

Fadel M. Megahed

Farmer School of Business, Miami University

Allison Jones-Farmer

Farmer School of Business, Miami University

Steve Rigdon

College for Public Health and Social Justice, Saint Louis University

Published

September 1, 2022

1 Objectives of this Document

This is the third Markdown in our analysis workflow (see step0_extract_and_eda.html, step0_transform.html, and step1_logistic_growth.html for our previous analyses). The main objective of this document is to:


2 R Setup and Required Packages

In this project, the open-source programming language is used to model the COVID-19 percent fully-vaccinated progression in different U.S. counties. is maintained by an international team of developers who make the language available at The Comprehensive R Archive Network. Readers interested in reusing our code and reproducing our results should have installed locally on their machines. can be installed on a number of different operating systems (see Windows, Mac, and Linux for the installation instructions for these systems). We also recommend using the RStudio interface for . The reader can download RStudio for free by following the instructions at the link. For non-R users, we recommend the Hands-on Programming with R for a brief overview of the software’s functionality. Hereafter, we assume that the reader has an introductory understanding of the programming language.

In the code chunk below, we load the packages used to support our analysis. Our input and output files can also be accessed/ downloaded from fmegahed/vaccines_spatial_and_optimization.

# create a files directory if it does not exist
if (!dir.exists('step2_spatial_model_files')) {dir.create('step2_spatial_model_files')}

# Check and install if these packages are not found locally on machine
if(require(pacman)==FALSE) install.packages("pacman"); library(pacman)
if(require(devtools)==FALSE) install.packages("devtools")

if(require(urbnmapr)==FALSE) devtools::install_github('UrbanInstitute/urbnmapr'); library(urbnmapr)
if(require(albersusa)==FALSE) devtools::install_github("hrbrmstr/albersusa"); library(albersusa)

# INLA
if(require(INLA) == FALSE){
  install.packages(
    "INLA", repos=c(getOption("repos"), INLA="https://inla.r-inla-download.org/R/stable"),
    dep=TRUE)
}; library(INLA)

# ggregplot and INLAutils for visualizing and manipulating INLA output
if('ggregplot' %in% rownames(installed.packages()) == FALSE) devtools::install_github('gfalbery/ggregplot')
if(require(INLAutils) == FALSE) devtools::install_github('timcdlucas/INLAutils'); library(INLAutils)

pacman::p_load(tidyverse, magrittr, janitor, lubridate, hms, skimr, forecast, # data analysis
               spdep, # to get the neighbors
               MCMCglmm, # required for ggregplot
               fontawesome, rsvg, # for fontawesome icons
               pander, knitr, DT, sparkline, # for nicely printed outputs
               scales, plotly, glue, # for plots
               sf, leaflet, tigris, tmap, # for maps
               gifski, av) # for creating gif and videos

3 A Spatial Model for Explaining the Variability in Growth Curve Paramaters

3.1 Preparing the Data for INLA

In the code chunk below, we merge the vaccines and the growth_curve_models.rds data frames together and save them into an object titled inla_tbl. The merge operation is done in such a way that the number of rows is equal to the number of rows in the vaccines dataset, i.e., all 3,108 contiguous U.S. counties. This will introduce NAs for \(\phi_1, \, \phi_2, \, \text{ and } \phi_3\) parameters for the 327 counties which we did not fit a logistic curve model. Furthermore, we engineer five new features:
- z, which is computed by dividing the phi1 estimate by 100, i.e., we convert the asymptote for the percent vaccinated by county to a fraction.
- z1, where we compute the logit of z. The rationale behind taking the logit of z is that we want to restrict the values of our asymptote to (0, 1) since the fraction of population vaccinated cannot exceed 1, and we do not expect it to be equal to exactly 0 or 1.
- z2 is equal to phi2. We introduce it only to make our notation consistent.
- z3, which is equal to the log() of phi3 since we want to restrict our phi3 (slope parameter) to positive values since we expect that the percent of vaccinated individuals in a given county to increase over the course of our study period.

We reuse the urbnmapr package to obtain the sf geometries for each of the 3,108 counties. We then capitalize on the approach of Paula Moraga, Chapter 5 to convert our sf geometries to a list of neighbours and a graph that can be read and processed by INLA.

# loading the phi parameters from step1_logistic_growth_curves
growth_curve_models = read_rds('data/growth_curve_models.rds') 

# loading the vaccines data which contains our three predictor variables 
vaccines = 
  read_rds('data/vaccines.rds') %>% 
  # selecting name-variables, population and the three predictors
  select( fips, recip_county, recip_state, svi_ctgy:perc_rep_votes) %>% 
  # removing duplicates since we no longer store the vaccination over time
  # our predictors do not change over the course of our dataset
  unique() %>% 
  mutate(county_name = paste(recip_county, recip_state, sep = ', ') ) %>% 
  select(-c( recip_county, recip_state)) %>% 
  relocate(county_name, .after = 'fips')

# we used right join since we wanted to have NAs for counties where we did not fit
# the logistic growth curve model
inla_tbl = right_join(
  growth_curve_models %>% select(-c(bic,mae,rmse,mse)),
  vaccines %>% select(-census2019),
  by = 'fips') %>% 
  ungroup() %>% 
  mutate( z = phi1/100,
          z1 = log(z/(1-z)),
          z2 = phi2,
          z3 = log(phi3),
          idarea = row_number()
          )

# getting the shape files
counties_sf = get_urbn_map(map = "counties", sf = TRUE) %>% 
  filter(!state_name %in% c('Alaska', 'Hawaii') )

# getting the neighbors list into inla
nb = poly2nb(counties_sf$geometry)
# based on https://www.paulamoraga.com/book-geospatial/sec-arealdatatheory.html
nb2INLA("data/map.adj", nb)
g = inla.read.graph(filename = "data/map.adj")

3.2 Model 1

We capitalize on the popular Besag-York-Mollié (BYM) model and the INLA package to fit the logit model for phi1 (\(z_1 = \text{logit}(\phi_1)\)). We save the results from the model in an object res1, which can be downloaded from fmegahed/vaccines_spatial_and_optimization/data/. Furthermore, we print summaries of the fixed effects, and the model hyperparameters below.

# INLA formula with default priors
# ui_s spatially correlated error terms and vi_s are the temporal correlated error terms
formula1 <- z1 ~ f(idarea, model = "bym", graph = g) + perc_rep_votes + svi_ctgy + metro_status

res1 <- inla(formula1,
  family = "gaussian", data = inla_tbl, control.predictor = list(compute = TRUE)
)

# saving the results from model1 into an rds file
write_rds(res1, file = 'data/inla_model1.rds')

# printing select results
cat('The fixed effects for the model are: \n')
round(res1$summary.fixed[, c(1:5,7)], digits = 3) %>% knitr::kable()

cat('A summary of the model hyperparameters: \n')
round(res1$summary.hyperpar[, 1:5], digits = 2) %>% knitr::kable()

cat('The fitted z1 = logit(phi1) values are: \n')
res1$summary.fitted.values %>% 
  mutate(county_names = inla_tbl$county_name) %>% 
  relocate(county_names, .before = mean) %>% 
  DT::datatable(options = list(pageLength = 10, autoWidth = TRUE), rownames = F) %>% 
  formatRound(2:6, digits = 2) 
The fixed effects for the model are: 
mean sd 0.025quant 0.5quant 0.975quant kld
(Intercept) 1.437 0.032 1.374 1.437 1.501 0
perc_rep_votes -0.020 0.000 -0.021 -0.020 -0.019 0
svi_ctgyB -0.053 0.019 -0.091 -0.053 -0.015 0
svi_ctgyC -0.096 0.020 -0.134 -0.096 -0.057 0
svi_ctgyD -0.200 0.020 -0.239 -0.200 -0.160 0
metro_statusNon-metro -0.050 0.016 -0.081 -0.050 -0.019 0
A summary of the model hyperparameters: 
mean sd 0.025quant 0.5quant 0.975quant
Precision for the Gaussian observations 13.96 0.80 11.66 13.34 16.48
Precision for idarea (iid component) 15.76 1.06 12.90 16.55 19.38
Precision for idarea (spatial component) 2718.71 2100.64 431.09 2170.09 8228.12
The fitted z1 = logit(phi1) values are: 

3.3 Model 2

Similar to Model1, we capitalize on the popular Besag-York-Mollié (BYM) model and the INLA package to fit a linear model for phi2 (\(z_2 = \phi_2\)). We save the results from the model in an object res2, which can be downloaded from fmegahed/vaccines_spatial_and_optimization/data/. Furthermore, we print summaries of the fixed effects, and the model hyperparameters below.

# INLA formula with default priors
# ui_s spatially correlated error terms and vi_s are the temporal correlated error terms
formula2 <- z2 ~ f(idarea, model = "bym", graph = g) + perc_rep_votes + svi_ctgy + metro_status

res2 <- inla(formula2,
  family = "gaussian", data = inla_tbl, control.predictor = list(compute = TRUE)
)

write_rds(res2, file = 'data/inla_model2.rds')

# printing select results
cat('The fixed effects for the model are: \n')
round(res2$summary.fixed[, c(1:5,7)], digits = 3) %>% knitr::kable()

cat('A summary of the model hyperparameters: \n')
round(res2$summary.hyperpar[, 1:5], digits = 2) %>% knitr::kable()

cat('The fitted z2 = phi2 values are: \n')
res2$summary.fitted.values %>% 
  mutate(county_names = inla_tbl$county_name) %>% 
  relocate(county_names, .before = mean) %>% 
  DT::datatable(options = list(pageLength = 10, autoWidth = TRUE), rownames = F) %>% 
  formatRound(2:6, digits = 2) 
The fixed effects for the model are: 
mean sd 0.025quant 0.5quant 0.975quant kld
(Intercept) -7.445 0.566 -8.554 -7.445 -6.335 0
perc_rep_votes 0.041 0.008 0.024 0.041 0.057 0
svi_ctgyB -0.413 0.340 -1.080 -0.413 0.254 0
svi_ctgyC 0.623 0.345 -0.054 0.623 1.300 0
svi_ctgyD 1.167 0.355 0.472 1.167 1.863 0
metro_statusNon-metro 0.385 0.276 -0.155 0.385 0.926 0
A summary of the model hyperparameters: 
mean sd 0.025quant 0.5quant 0.975quant
Precision for the Gaussian observations 0.03 0.00 0.03 0.03 0.03
Precision for idarea (iid component) 0.14 0.04 0.09 0.14 0.24
Precision for idarea (spatial component) 2453.13 2905.38 189.66 1574.64 10025.11
The fitted z2 = phi2 values are: 

3.4 Model 3

Similar to Models 1 and 2, we capitalize on the popular Besag-York-Mollié (BYM) model and the INLA package to fit model for the log of phi3 (\(z_3 = \log(\phi_3)\)). We save the results from the model in an object res3, which can be downloaded from fmegahed/vaccines_spatial_and_optimization/data/. Furthermore, we print summaries of the fixed effects, and the model hyperparameters below.

# INLA formula with default priors
# ui_s spatially correlated error terms and vi_s are the temporal correlated error terms
formula3 <- z3 ~ f(idarea, model = "bym", graph = g) + perc_rep_votes + svi_ctgy + metro_status

res3 <- inla(formula3,
  family = "gaussian", data = inla_tbl, control.predictor = list(compute = TRUE)
)

write_rds(res3, file = 'data/inla_model3.rds')

# printing select results
cat('The fixed effects for the model are: \n')
The fixed effects for the model are: 
round(res3$summary.fixed[, c(1:5,7)], digits = 3) %>% knitr::kable()
mean sd 0.025quant 0.5quant 0.975quant kld
(Intercept) 0.053 0.034 -0.013 0.053 0.119 0
perc_rep_votes -0.006 0.001 -0.007 -0.006 -0.005 0
svi_ctgyB -0.088 0.020 -0.128 -0.088 -0.048 0
svi_ctgyC -0.225 0.021 -0.265 -0.225 -0.184 0
svi_ctgyD -0.436 0.021 -0.478 -0.436 -0.395 0
metro_statusNon-metro 0.084 0.016 0.052 0.084 0.116 0
cat('A summary of the model hyperparameters: \n')
A summary of the model hyperparameters: 
round(res3$summary.hyperpar[, 1:5], digits = 2) %>% knitr::kable()
mean sd 0.025quant 0.5quant 0.975quant
Precision for the Gaussian observations 12.69 0.72 10.60 12.09 14.96
Precision for idarea (iid component) 14.48 0.98 11.85 15.27 17.84
Precision for idarea (spatial component) 2665.30 2116.53 387.91 2105.50 8219.38
cat('The fitted z3 = log(phi3) values are: \n')
The fitted z3 = log(phi3) values are: 
res3$summary.fitted.values %>% 
  mutate(county_names = inla_tbl$county_name) %>% 
  relocate(county_names, .before = mean) %>% 
  DT::datatable(options = list(pageLength = 10, autoWidth = TRUE), rownames = F) %>% 
  formatRound(2:6, digits = 2) 

4 Visualizing the INLA Model Outputs

4.1 Effect Sizes for all three Models

Given that we used the same covariates across all three models, we visualize the effect sizes across the models below. Our code capitalizes on the ggregplot developmental package, and specifically the Efxplot(), which allows us to create a ggplot2 object containing the Effect Sizes and their 95% credible intervals.

# capitalizing on the Efxplot from the ggregplot package
ggregplot::Efxplot(list(res1, res2, res3), PointSize = 2, BarWidth = 0.5, 
    tips = 0.25) + 
  scale_color_brewer(palette = 'Dark2') + 
  labs(title = 'Effect Sizes and their Credible Intervals') + 
  theme(legend.position = 'bottom')

4.2 Exploring Each Model in Detail

To provide additional insight, we present the following plots for each model:

plot provides a visual overview of (top-left): the marginal posterior distributions, prior distributions and credible intervals of the intercept … and coefficients for the covariates …; (top-right): the marginal posterior distributions and credible intervals of model hyperparameters; the marginal posterior mean and 95% credible intervals of any random effects if specified in the model and; (bottom-left) and the linear predictor and fitted values (bottom-right).

To facilitate the visualization of this data, we also present a separate plot of each of those visuals in a dedicated subsection.

  • A spatial plot of the posterior means for each of the \(z_1, z_2, \text{ and } z_3\) parameters is also presented. We only show the interactive leaflet version of these plot in this document; however, we also generate and save a static version (with the chunk.option command include = F). The static versions can be seen at fmegahed/vaccines_spatial_optimization.

4.2.1 Model 1

p = autoplot(res1)

4.2.1.1 Posterior Distribution of the Intercept and Covariates

p[[1]] + 
  geom_line(aes(color = var), size = 1.3) +
  scale_color_brewer(palette = 'Dark2')

4.2.1.2 Precision

p[[2]] + 
  geom_line(aes(color = var), size = 1.3) +
  scale_color_brewer(palette = 'Dark2')

4.2.1.3 Marginal Posterior Mean and 95% CI of the Spatial Random Effects

p[[3]] + 
  ylim(-1,1) +
  geom_line(aes(color = var), size = 1.3) +
  scale_color_brewer(palette = 'Dark2') +
  scale_x_continuous(breaks = pretty_breaks(n=10))

4.2.1.4 Marginal posterior mean and 95% credible intervals of the linear predictor, and fitted values

p[[4]] + 
  ylim(-1,1) +
  scale_x_continuous(breaks = pretty_breaks(n=10))

4.2.1.5 Spatial Plot of the Posterior Means

# creating the interactive plot
leaflet_sf = 
  # getting the shape file for the continental US with loglat projection
  counties_sf(proj = 'longlat') %>% filter(!state %in% c('Alaska', 'Hawaii')) %>% 
  # adding the covariates
  right_join(vaccines %>% dplyr::select(-county_name), by = 'fips')

# adding the results from model1
leaflet_sf$model1mean = res1$summary.fitted.values[, "mean"]

# selecting the color palette
leaflet_pal =  colorNumeric('YlOrRd', domain = leaflet_sf$model1mean,
                            na.color = "white")

leaflet() %>% # initializing the leaflet map
  setView(lng = -96, lat = 37.8, zoom = 4) %>% # setting the view on Continental US
  addTiles() %>% # adding the default tiles
  addPolygons(
    data = leaflet_sf, stroke = FALSE, fillColor = ~leaflet_pal(leaflet_sf$model1mean), # adding the data
    weight = 2, opacity = 1, color = "white", dashArray = "3", fillOpacity = 0.7, # adding color specs
    popup = paste(
      "County:",leaflet_sf$name, '<br>',
      "SVI Cat:",leaflet_sf$svi_ctgy, '<br>',
      "Metro Status:",leaflet_sf$metro_status, '<br>',
      '% Rep Votes:', round(leaflet_sf$perc_rep_votes, 1), '<br>',
      "z1 Posterior Mean:", scales::comma(round(leaflet_sf$model1mean, 1)), '<br>')
  ) %>% #pop-up Menu
  addLegend(position = "bottomleft", pal = leaflet_pal, values = leaflet_sf$model1mean,
            title = "z1 Posterior Mean", opacity = 1) # legend formatting

4.2.2 Model 2

p = autoplot(res2)

4.2.2.1 Posterior Distribution of the Intercept and Covariates

p[[1]] + 
  geom_line(aes(color = var), size = 1.3) +
  scale_color_brewer(palette = 'Dark2')

4.2.2.2 Precision

p[[2]] + 
  geom_line(aes(color = var), size = 1.3) +
  scale_color_brewer(palette = 'Dark2')

4.2.2.3 Marginal Posterior Mean and 95% CI of the Spatial Random Effects

p[[3]] + 
  ylim(-1,1) +
  geom_line(aes(color = var), size = 1.3) +
  scale_color_brewer(palette = 'Dark2') +
  scale_x_continuous(breaks = pretty_breaks(n=10))

4.2.2.4 Marginal posterior mean and 95% credible intervals of the linear predictor, and fitted values

p[[4]] + 
  ylim(-10,10) +
  scale_x_continuous(breaks = pretty_breaks(n=10))

4.2.2.5 Spatial Plot of the Posterior Means

# adding the results from model2
leaflet_sf$model2mean = res2$summary.fitted.values[, "mean"]

# selecting the color palette
leaflet_pal =  colorNumeric('YlOrRd', domain = leaflet_sf$model2mean,
                            na.color = "white")

leaflet() %>% # initializing the leaflet map
  setView(lng = -96, lat = 37.8, zoom = 4) %>% # setting the view on Continental US
  addTiles() %>% # adding the default tiles
  addPolygons(
    data = leaflet_sf, stroke = FALSE, fillColor = ~leaflet_pal(leaflet_sf$model2mean), # adding the data
    weight = 2, opacity = 1, color = "gray", dashArray = "3", fillOpacity = 0.7, # adding color specs
    popup = paste(
      "County:",leaflet_sf$name, '<br>',
      "SVI Cat:",leaflet_sf$svi_ctgy, '<br>',
      "Metro Status:",leaflet_sf$metro_status, '<br>',
      '% Rep Votes:', round(leaflet_sf$perc_rep_votes, 1), '<br>',
      "z2 Posterior Mean:", scales::comma(round(leaflet_sf$model2mean, 1)), '<br>')
  ) %>% #pop-up Menu
  addLegend(position = "bottomleft", pal = leaflet_pal, values = leaflet_sf$model2mean,
            title = "z2 Posterior Mean", opacity = 1) # legend formatting

4.2.3 Model 3

p = autoplot(res3)

4.2.3.1 Posterior Distribution of the Intercept and Covariates

p = autoplot(res3)

p[[1]] + 
  geom_line(aes(color = var), size = 1.3) +
  scale_color_brewer(palette = 'Dark2')

4.2.3.2 Precision

p[[2]] + 
  geom_line(aes(color = var), size = 1.3) +
  scale_color_brewer(palette = 'Dark2')

4.2.3.3 Marginal Posterior Mean and 95% CI of the Spatial Random Effects

p[[3]] + 
  ylim(-1,1) +
  geom_line(aes(color = var), size = 1.3) +
  scale_color_brewer(palette = 'Dark2') +
  scale_x_continuous(breaks = pretty_breaks(n=10))

4.2.3.4 Marginal posterior mean and 95% credible intervals of the linear predictor, and fitted values

p[[4]] + 
  ylim(-5,5) +
  scale_x_continuous(breaks = pretty_breaks(n=10))

4.2.3.5 Spatial Plot of the Posterior Means

# adding the results from model3
leaflet_sf$model3mean = res3$summary.fitted.values[, "mean"]

# selecting the color palette
leaflet_pal =  colorNumeric('YlOrRd', domain = leaflet_sf$model3mean,
                            na.color = "white")

leaflet() %>% # initializing the leaflet map
  setView(lng = -96, lat = 37.8, zoom = 4) %>% # setting the view on Continental US
  addTiles() %>% # adding the default tiles
  addPolygons(
    data = leaflet_sf, stroke = FALSE, fillColor = ~leaflet_pal(leaflet_sf$model3mean), # adding the data
    weight = 2, opacity = 1, color = "gray", dashArray = "3", fillOpacity = 0.7, # adding color specs
    popup = paste(
      "County:",leaflet_sf$name, '<br>',
      "SVI Cat:",leaflet_sf$svi_ctgy, '<br>',
      "Metro Status:",leaflet_sf$metro_status, '<br>',
      '% Rep Votes:', round(leaflet_sf$perc_rep_votes, 1), '<br>',
      "z3 Posterior Mean:", scales::comma(round(leaflet_sf$model3mean, 1)), '<br>')
  ) %>% #pop-up Menu
  addLegend(position = "bottomleft", pal = leaflet_pal, values = leaflet_sf$model3mean,
            title = "z3 Posterior Mean", opacity = 1) # legend formatting

5 Evaluating the Models’ Predictive Performance

5.1 The \(\mathbf{\Phi}\) Matrix

# a vector of estimated phi1 parameters
z1_post_mean = res1$summary.fitted.values[, 'mean'] 
phi1_hat_vec = exp(z1_post_mean)/( 1 + exp(z1_post_mean) )
phi1_hat_vec = phi1_hat_vec*100

# a vector of estimated phi2 parameters
phi2_hat_vec = res2$summary.fitted.values[, 'mean'] 

# a vector of estimated phi3 parameters
z3_post_mean = res3$summary.fitted.values[, 'mean'] 
phi3_hat_vec = exp(z3_post_mean)

Phi_hat_tbl = tibble(fips = inla_tbl$fips,
                     county_name = inla_tbl$county_name,
                     phi1 = inla_tbl$phi1,
                     phi1_hat = phi1_hat_vec,
                     phi2 = inla_tbl$phi2,
                     phi2_hat = phi2_hat_vec,
                     phi3 = inla_tbl$phi3,
                     phi3_hat = phi3_hat_vec)

Phi_hat_tbl %>%
  datatable() %>% 
  formatRound(columns = c('phi1', 'phi1_hat', 'phi2',
                          'phi2_hat', 'phi3', 'phi3_hat'))
write_rds(x = Phi_hat_tbl, file = 'data/phi_hat_tbl.rds')

5.2 Differences between INLA Estimated Phis and those from Logistic Growth Curve Models

phi_spatial_vs_logistic_growth = 
  rbind( 
    forecast::accuracy(object = Phi_hat_tbl$phi1_hat, x = Phi_hat_tbl$phi1),
    forecast::accuracy(object = Phi_hat_tbl$phi2_hat, x = Phi_hat_tbl$phi2),
    forecast::accuracy(object = Phi_hat_tbl$phi3_hat, x = Phi_hat_tbl$phi3)
  )

row.names(phi_spatial_vs_logistic_growth) = c('phi1', 'phi2', 'phi3')

round(phi_spatial_vs_logistic_growth, digits = 2) %>% 
  knitr::kable()
ME RMSE MAE MPE MAPE
phi1 -0.08 4.22 3.17 -1.64 6.78
phi2 0.00 5.38 1.55 -21.19 27.91
phi3 0.06 0.48 0.13 -1.85 12.56

5.3 Impact on Predictive Accuracy for the 2781 Counties

5.3.1 MAE By County

# reloading the vaccines data to include the dates
vaccines_w_dates = 
  read_rds('data/vaccines.rds') %>% 
  mutate(county_name = paste(recip_county, recip_state, sep = ', ') ) %>% 
  dplyr::select(-c( recip_county, recip_state)) %>% 
  relocate(county_name, .after = 'fips')

# Computing the expected number of observations (i.e., months of data) per county
months_from_start = 
  seq.Date(from = min(vaccines_w_dates$date), 
           to = max(vaccines_w_dates$date),
           by = 'month') %>% 
  length()

# creating a vector of months from start, which will be our dependent variable representing time
months_from_start = 1:months_from_start


# custom function for logistic growth curve
lgc_fit = function(p1, p2, p3, time = months_from_start){
  p1/(1+exp(-(p2+p3*time)))
}

# computing the predictive accuracy with respect to two baselines
## actuals (i.e., CDC reported data) and 'lgc_fitted'
pred_acc_per_county = 
  # actuals
  vaccines_w_dates %>% dplyr::select(fips, county_name, series_complete_pop_pct) %>% 
  right_join(
    Phi_hat_tbl %>% drop_na() %>% dplyr::select(-county_name),
    by = 'fips') %>% 
  group_by(fips) %>% 
  nest(actuals = series_complete_pop_pct) %>%  
  mutate(
    actuals = map(.x = actuals, .f = pull),
    lgc_fitted = pmap(list(phi1, phi2, phi3), .f = lgc_fit),
    spatial_fitted = pmap(list(phi1_hat, phi2_hat, phi3_hat), .f = lgc_fit),
    metrics_actuals = map2(.x = spatial_fitted, .y = actuals, .f = forecast::accuracy),
    metrics_lgc = map2(.x = spatial_fitted, .y = lgc_fitted, .f = forecast::accuracy),
    mae_actuals = map_dbl(.x = metrics_actuals, .f = extract2, c(3)),
    mae_lgc = map_dbl(.x = metrics_lgc, .f = extract2, c(3))
    )


# saving the data
write_rds(pred_acc_per_county, file = 'data/pred_acc_per_county.rds')


# printing the table

### custom_function for spk_line
spk_line_exp_growth = function(phi1, phi2, phi3, time = months_from_start){
  lgc_fitted = phi1/(1+exp(-(phi2+phi3*time))) # fitted values 
  # adding spark lines based on the sparkline package
  spk_chr(lgc_fitted, type = 'line')
}

### sparkline to be added to the table
spark_lines = Phi_hat_tbl %>% 
  group_by(fips) %>% 
  transmute(
    lgc_curve = pmap_chr(.l = list(phi1, phi2, phi3), .f = spk_line_exp_growth),
    spatial_curve = pmap_chr(.l = list(phi1_hat, phi2_hat, phi3_hat), .f = spk_line_exp_growth)
  )

# Printing the actual table
pred_acc_per_county %>% 
  ungroup() %>% 
  dplyr::select(fips, county_name, mae_actuals, mae_lgc) %>% 
  left_join(y = spark_lines %>% ungroup(), by = 'fips') %>% 
  dplyr::select(-fips) %>% 
  DT::datatable(rownames = FALSE, escape = FALSE, 
                options = list(pageLength = 10, 
                               # adding the htmlwidget to the code
                               fnDrawCallback = htmlwidgets::JS(
                                 '
function(){
  HTMLWidgets.staticRender();
}
'))) %>% 
  spk_add_deps()   %>%
  DT::formatRound(columns = c('mae_actuals', 'mae_lgc'), digits = 2)

Based on the Bayesian models, the average MAEs are: 5.34 and 5.14 when compared to the CDC reported data and the smoothed/fitted logistic growth curves, respectively.

5.3.2 MAE By State

5.3.2.1 Table

pred_acc_per_county %>% 
  ungroup() %>% 
  mutate(state = str_sub(county_name, start = -2)) %>% 
  group_by(state) %>% 
  summarise(q1_mae_actuals = quantile(mae_actuals, probs =  0.25),
            median_mae_actuals = median(mae_actuals),
            avg_mae_actuals = mean(mae_actuals),
            q3_mae_actuals = quantile(mae_actuals, probs =  0.75),
            q1_mae_lgc = quantile(mae_lgc, probs =  0.25),
            median_mae_lgc = median(mae_lgc),
            avg_mae_lgc = mean(mae_lgc),
            q3_mae_lgc = quantile(mae_lgc, probs =  0.75)) -> pred_acc_per_state

# saving the data
write_rds(pred_acc_per_state, file = 'data/pred_acc_per_state.rds')

pred_acc_per_state %>% 
  mutate(across(q1_mae_actuals:q3_mae_lgc, round, digits =2)) %>% 
 DT::datatable(rownames = FALSE, escape = FALSE, 
                options = list(pageLength = 10))

5.3.2.2 Figure

# creating a static visual
states_sf = left_join(
  x = states_sf, 
  y = pred_acc_per_state %>% dplyr::select(state, median_mae_actuals ),
  by = c('state_abbv' = 'state'))

tm_shape(states_sf) + 
  tm_borders(col = "black", lwd = 0.5) +
  tm_polygons('median_mae_actuals', title = 'Median MAE', palette = "YlOrRd")

5.3.2.3 Correlation with the State Unknowns \(\mathbf{Y}^*\)

Based on the state map in the previous subsection, we hypothesize that some of the large errors can be possibly explained by a larger percent of vaccinated but not assigned to a county individuals within a given state. In the code chunk below, we perform the required data processing/transformations to examine the correlation within each state.

# count of unknowns per state
y_star = read_rds('data/y_star_final.rds') %>% 
  rename(unknown_counts = series_complete_yes)

# computing total count of vaccinations per state
cdc_state_vacc = 
  read_csv('https://data.cdc.gov/api/views/unsk-b7fc/rows.csv?accessType=DOWNLOAD') %>% 
  # converting names to lower case
  clean_names() %>% 
  # selecting columns of interest
  dplyr::select(date:location, 
         series_complete_yes) %>% 
  rename(
    # renaming the location column to recip_state to match our earlier data
    recip_state = location,
    # renaming series_complete_yes to a more descriptive name
    total_vacc_per_state = series_complete_yes) %>% 
  # converting the date column to date and creating the day_of_month variable
  mutate(date = mdy(date), day_of_month = day(date)) %>% 
  # filtering the data to the continental U.S. and first day of the month
  filter(!recip_state %in% c('AK', 'AS', 'FM', 'GU', 'HI', 
                             'MH', 'MP', 'PR', 'PW', 'UNK', 'VI',
                             'BP2', 'DD2', 'IH2', 'LTC', 'RP', 'US', 'VA2') &
        day_of_month == 1) %>% 
  # arranging the data in ascending order by date and state
  arrange(date, recip_state) %>% 
  dplyr::select(recip_state, total_vacc_per_state) %>% 
  nest(total_vacc_per_state = total_vacc_per_state)


# the required data for the comparison

## computing the percent unknowns from the total vaccinated in the state
state_comparison_tbl = 
  left_join(x = cdc_state_vacc,
            y = y_star %>% ungroup() %>% dplyr::select(recip_state, unknown_counts),
            by = 'recip_state') %>% 
  mutate(percent_unknown = 
           map2(
             .x = unknown_counts, .y= total_vacc_per_state,
             .f = function(x,y){100*x/(y)} )
  )

## the median absolute error per state and time
# note that it is computed on perc vaccinated (not accounting for unknowns)
pred_acc_median_state = 
  # reading the pred per county, selecting columns of interest and ungroup
  read_rds('data/pred_acc_per_county.rds') %>%  
  dplyr::select( c(fips, county_name, actuals, spatial_fitted)) %>% 
  ungroup() %>% 
  # extracting state name and computing the absolute error
  mutate(
    recip_state = str_sub(county_name, start = -2),
    abs_error = map2(.x = actuals, .y= spatial_fitted, 
                      .f = function(x,y){ abs(x - y) } )
  ) %>% 
  # unnesting the abs_error
  unnest(abs_error) %>% 
  # grouping by recip_state and county_name so we can count the number of rows
  # in each county
  group_by(recip_state, county_name) %>% 
  # counting the months in each county
  mutate(months_from_start = row_number() ) %>% 
  # ungroup and group by both the state and time
  ungroup() %>% group_by(recip_state, months_from_start) %>% 
  # a summary of the absolute error over time per state
  summarise(median_abs_error = round(median(abs_error), digits = 3) )

# printing the obtained median absolute error for a sanity check
pred_acc_median_state %>% DT::datatable()
# left joining with the state_comparison table after nesting the output
cor_tbl = 
  # left joining with the original state_comparison_tbl
  # with a nested version of the pred_acc_median_state so they are similar shape/size
  left_join(
    x = state_comparison_tbl %>% dplyr::select(c(recip_state, percent_unknown)) %>% ungroup(),
    y= pred_acc_median_state %>% ungroup() %>% 
      dplyr::select( c(recip_state, median_abs_error ) ) %>% 
      group_by(recip_state) %>% 
      nest(median_abs_error = median_abs_error),
    by = 'recip_state') %>% 
  mutate(
    # removing INF values from not having state totals prior to April 2021
    percent_unknown = map(.x = percent_unknown, .f = function(x){
      x = as.numeric( unlist(x) )
      x = x[is.finite(x)]
      return(x)
    }),
    # making the length of the median_abs_error consistent with the perc_unknown
    median_abs_error = map2(.x = median_abs_error, .y = percent_unknown,
                            .f = function(x,y){
                              x = as.numeric(unlist(x))
                              y = as.numeric(unlist(y))
                              len_x = length(x)
                              len_y = length(y)
                              diff_len = len_x - len_y
                              if(diff_len > 0){
                                x = x[-c(1:diff_len)]
                              }
                              return(x)
                            }) ) %>% 
  # removing TX since we had no baseline for most of its data
  filter(recip_state != 'TX') %>% 
  # computing the correlation
  mutate(
    max_perc_unknown = map_dbl(.x = percent_unknown, .f = max) %>% round(digits = 2),
    corr = map2_dbl(.x = percent_unknown, .y = median_abs_error, .f = cor) %>% 
      round(digits = 3)
  )

cor_tbl %>% 
  dplyr::select(c(recip_state, max_perc_unknown, corr)) %>% 
  datatable()
write_rds(x = pred_acc_median_state, file = 'data/pred_acc_median_state.rds')
write_rds(x = cor_tbl, file = 'data/cor_tbl.rds')

5.4 The Logistic Growth Curves for our Sample FIPS

Here, we reproduce the graphs constructed in Sec 4 of step1_logistic_growth.html with the addition of an extra line for the curves obtained from our spatial model. One difference is that the color is used to show the curves fitted from the exponential growth model and the spatial model instead of the svi_ctgy.

sample_fips = read_rds(file = 'data/sample_fips.rds')
vaccines = read_rds(file = 'data/vaccines.rds')

# sampling the original data
vaccines %>% 
  filter(fips %in% sample_fips) %>% 
  mutate(county_name = paste(recip_county, recip_state, sep = ', ') ) %>% 
  group_by(county_name, fips, ) %>% 
  dplyr::select(
    -c(date, mmwr_week, recip_county, recip_state, completeness_pct, census2019) ) %>% 
  mutate(series_complete_pop_pct = list(series_complete_pop_pct)) %>% 
  unique() -> 
  vaccines_sampled 


# constructing the viz_tbl
## without imputed counties
viz_tbl = read_rds('data/pred_acc_per_county.rds') %>% 
  ungroup() %>% 
  filter(fips %in% sample_fips) %>% 
  dplyr::select(fips, county_name, actuals, lgc_fitted, spatial_fitted)

## a tbl for imputed counties
imputed_counties = 
  Phi_hat_tbl %>% filter(fips %in% sample_fips & !fips %in% viz_tbl$fips)

imputed_tbl = 
 read_rds("data/vaccines.rds") %>%
  dplyr::select(fips,svi_ctgy, metro_status, perc_rep_votes, series_complete_pop_pct) %>% 
  right_join(
    imputed_counties,
    by = 'fips') %>% 
  ungroup() %>% 
  group_by(fips) %>% 
  nest(data = series_complete_pop_pct) %>%  
  mutate(
    actuals = map(.x = data, .f = function(x) x$series_complete_pop_pct),
    lgc_fitted = pmap(list(phi1, phi2, phi3), .f = lgc_fit),
    spatial_fitted = pmap(list(phi1_hat, phi2_hat, phi3_hat), .f = lgc_fit)
    ) %>% 
  dplyr::select(colnames(viz_tbl))

# combining the not imputed and imputed counties together
viz_tbl = bind_rows(viz_tbl, imputed_tbl)

# paramaters for plotting
colors_brewer = RColorBrewer::brewer.pal(n = 8, name = 'Paired')
cols = c("lgc_fitted" = colors_brewer[1], 'spatial_fitted' = colors_brewer[2])
shapes = c("Metro" = 19, 'Non-metro' = 17)
lines = c('lgc_fitted' = 'longdash', 'spatial_fitted' = 'dotdash')

# constructing a tibble which contains all plots
combined_plots = 
  # joining the data
  left_join(x = vaccines_sampled %>%
              ungroup() %>% dplyr::select(-c(county_name, series_complete_pop_pct)) ,
            y = viz_tbl %>% ungroup(), by = 'fips') %>%
  group_by(fips) %>% 
  mutate(months_from_start = list(months_from_start)) %>% 
  ungroup() %>%
  # making the data longer
  unnest(cols = c(months_from_start, actuals, lgc_fitted, spatial_fitted)) %>% 
  # grouping and computing the months from start for our dataset
  # it will repeat for each fips due to grouping
  group_by(fips) %>% 
  # creating a dates column and converting the MAE column to character
  mutate(dates_from_start = seq.Date(from = min(vaccines$date),
                                     to = max(vaccines$date),
                                     by = 'months'),
         svi_ctgy = as.character(svi_ctgy),
         metro_status = as.character(metro_status)) %>% 
  ungroup() %>% 
  # so it follows the same order as the TS plots in 4.2.2. in 
  # https://fmegahed.github.io/research/covid_vaccines/step0_extract_and_eda.html#time-series-plots
  group_by(svi_ctgy, metro_status) %>% 
  arrange(svi_ctgy, metro_status) %>% 
  ungroup() %>% 
  # creating a nested list with variables to be used in the plot
  nest(
    data = c(perc_rep_votes, svi_ctgy, metro_status,
             months_from_start, dates_from_start,
             actuals, lgc_fitted, spatial_fitted) 
  ) %>%
  # storing each plot in a variable titled 'plot'
  mutate(
    # creating the plot
    plot = map2(
      .x= data, .y = county_name, 
      .f = function(a, b){
        df = a %>% pivot_longer( cols = c(lgc_fitted, spatial_fitted) )
        
        ggplot(data = df, aes(x = dates_from_start, color = name, shape = metro_status,
                              linetype = name)) +
          geom_point(aes(y = actuals),  color = 'black', size = 2)+
          geom_line(aes(y = value), size=1) +
          scale_color_manual(values = cols) +
          scale_shape_manual(values = shapes) +
          labs(x = 'Months from Start', y = '% Vaccinated', 
               title = paste(b) )  +
          scale_x_date(breaks = scales::pretty_breaks(n= 6)) +
          scale_y_continuous(breaks = scales::pretty_breaks(n= 5), limits = c(0,100)) +
          theme_bw(base_size = 10) +
          theme(legend.position = 'bottom', 
                legend.title=element_text(size=9),
                legend.text=element_text(size=8)) 
      }
    )
  )

# printing the tabsets programmatically
names(combined_plots$plot) = combined_plots$county_name

p_combined = ggpubr::ggarrange(plotlist = combined_plots$plot, 
                               ncol = 2,
                               nrow = 4,
                               common.legend = T)

p_combined


6 Appendix

In this appendix, we print all the packages used in our analysis and their versions to assist with reproducing our results/analysis.

pander(sessionInfo(), compact = TRUE) # printing the session info

R version 4.2.1 (2022-06-23 ucrt)

Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: LC_COLLATE=English_United States.utf8, LC_CTYPE=English_United States.utf8, LC_MONETARY=English_United States.utf8, LC_NUMERIC=C and LC_TIME=English_United States.utf8

attached base packages: parallel, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: MASS(v.7.3-58.1), av(v.0.8.1), gifski(v.1.6.6-1), tmap(v.3.3-3), tigris(v.1.6.1), leaflet(v.2.1.1), glue(v.1.6.2), plotly(v.4.10.0), scales(v.1.2.1), sparkline(v.2.0), DT(v.0.24), knitr(v.1.40), pander(v.0.6.5), rsvg(v.2.3.1), fontawesome(v.0.3.0), MCMCglmm(v.2.34), ape(v.5.6-2), coda(v.0.19-4), spdep(v.1.2-5), sf(v.1.0-8), spData(v.2.2.0), forecast(v.8.17.0), skimr(v.2.1.4), hms(v.1.1.2), lubridate(v.1.8.0), janitor(v.2.1.0), magrittr(v.2.0.3), forcats(v.0.5.2), stringr(v.1.4.1), dplyr(v.1.0.9), purrr(v.0.3.4), readr(v.2.1.2), tidyr(v.1.2.0), tibble(v.3.1.8), tidyverse(v.1.3.2), INLAutils(v.0.0.6), INLA(v.22.05.07), sp(v.1.5-0), foreach(v.1.5.2), Matrix(v.1.4-1), albersusa(v.0.4.1), urbnmapr(v.0.0.0.9002), devtools(v.2.4.4), usethis(v.2.1.6), pacman(v.0.5.1), RColorBrewer(v.1.1-3) and ggplot2(v.3.3.6)

loaded via a namespace (and not attached): utf8(v.1.2.2), tidyselect(v.1.1.2), htmlwidgets(v.1.5.4), grid(v.4.2.1), maptools(v.1.1-4), munsell(v.0.5.0), codetools(v.0.2-18), units(v.0.8-0), miniUI(v.0.1.1.1), withr(v.2.5.0), colorspace(v.2.0-3), highr(v.0.9), uuid(v.1.1-0), rstudioapi(v.0.14), wk(v.0.6.0), ggsignif(v.0.6.3), TTR(v.0.24.3), labeling(v.0.4.2), repr(v.1.1.4), bit64(v.4.0.5), farver(v.2.1.1), vctrs(v.0.4.1), generics(v.0.1.3), xfun(v.0.32), R6(v.2.5.1), cachem(v.1.0.6), assertthat(v.0.2.1), vroom(v.1.5.7), promises(v.1.2.0.1), nnet(v.7.3-17), googlesheets4(v.1.0.1), rgeos(v.0.5-9), gtable(v.0.3.0), lwgeom(v.0.2-8), processx(v.3.7.0), MatrixModels(v.0.5-0), timeDate(v.4021.104), rlang(v.1.0.4), splines(v.4.2.1), rstatix(v.0.7.0), rgdal(v.1.5-32), lazyeval(v.0.2.2), gargle(v.1.2.0), dichromat(v.2.0-0.1), broom(v.1.0.1), s2(v.1.1.0), yaml(v.2.3.5), reshape2(v.1.4.4), abind(v.1.4-5), modelr(v.0.1.9), crosstalk(v.1.2.0), backports(v.1.4.1), httpuv(v.1.6.5), quantmod(v.0.4.20), tensorA(v.0.36.2), tools(v.4.2.1), cubature(v.2.0.4.5), ellipsis(v.0.3.2), jquerylib(v.0.1.4), raster(v.3.5-29), proxy(v.0.4-27), sessioninfo(v.1.2.2), Rcpp(v.1.0.9), plyr(v.1.8.7), base64enc(v.0.1-3), classInt(v.0.4-7), ps(v.1.7.1), prettyunits(v.1.1.1), ggpubr(v.0.4.0), deldir(v.1.0-6), fracdiff(v.1.5-1), cowplot(v.1.1.1), urlchecker(v.1.0.1), tmaptools(v.3.1-1), zoo(v.1.8-10), haven(v.2.5.1), leafem(v.0.2.0), fs(v.1.5.2), data.table(v.1.14.2), lmtest(v.0.9-40), reprex(v.2.0.2), googledrive(v.2.0.0), pkgload(v.1.3.0), mime(v.0.12), evaluate(v.0.16), xtable(v.1.8-4), XML(v.3.99-0.10), readxl(v.1.4.1), gridExtra(v.2.3), compiler(v.4.2.1), KernSmooth(v.2.23-20), crayon(v.1.5.1), htmltools(v.0.5.3), corpcor(v.1.6.10), later(v.1.3.0), tzdb(v.0.3.0), DBI(v.1.1.3), dbplyr(v.2.2.1), rappdirs(v.0.3.3), boot(v.1.3-28), car(v.3.1-0), cli(v.3.3.0), quadprog(v.1.5-8), pkgconfig(v.2.0.3), ggregplot(v.0.1.0), foreign(v.0.8-82), terra(v.1.6-7), xml2(v.1.3.3), bslib(v.0.4.0), rvest(v.1.0.3), snakecase(v.0.11.0), callr(v.3.7.2), digest(v.0.6.29), rmarkdown(v.2.16), cellranger(v.1.1.0), leafsync(v.0.1.0), curl(v.4.3.2), shiny(v.1.7.2), urca(v.1.3-3), lifecycle(v.1.0.1), nlme(v.3.1-159), jsonlite(v.1.8.0), tseries(v.0.10-51), carData(v.3.0-5), viridisLite(v.0.4.1), fansi(v.1.0.3), pillar(v.1.8.1), lattice(v.0.20-45), fastmap(v.1.1.0), httr(v.1.4.4), pkgbuild(v.1.3.1), xts(v.0.12.1), remotes(v.2.4.2), leaflet.providers(v.1.9.0), png(v.0.1-7), iterators(v.1.0.14), bit(v.4.0.4), sass(v.0.4.2), class(v.7.3-20), stringi(v.1.7.8), profvis(v.0.3.7), stars(v.0.5-6), memoise(v.2.0.1) and e1071(v.1.7-11)